Let’s simulate some diploid data, basically under the model, in continuous space, and find the covariance matrix between genotypes (coded as fraction of alleles that are reference):
nind <- 1000
nloci <- 1e4
simdata <- sim_data(nind=nind,nloci=nloci)
covmat <- cov( simdata$genotypes )
cols <- rainbow(nind)[rank(simdata$locs[,1])]
pchs <- as.numeric(cut(simdata$locs[,2],breaks=12))
Here’s what the PCs of this un-normalized matrix look like: on the left is the sampling scheme, with colors and point types indicating location on the (geographic) map, and on the right is the first two PCs.
pca <- eigen(covmat)
layout(t(1:2))
plot(simdata$locs, col=cols, pch=pchs, main="locations" )
plot(pca$vectors[,1:2], col=cols, pch=pchs, main="PC map", xlab="PC1", ylab="PC2" )
Also, the first 8 PCs as colors on the sample map:
layout(matrix(1:9,nrow=3))
par(mar=c(2,2,2,0.5)+.1)
plot(pca$values/sum(pca$values),ylab="proportion", main="proportion of variance explained", xlab="PC#")
for (k in 1:8) {
plot( simdata$locs, col=heat.colors(64)[cut(pca$vectors[,k],breaks=64)], main=paste("PC",k,sep='') )
}
That first PC is almost constant, it turns out, and as we might guess from the map above, conveys mean covariance to everyone else:
plot(pca$vectors[,1],rowMeans(covmat),xlab="PC1",ylab="mean covariance")
Our gestalt is that the first PC is constant: since PCs are orthonormal, it should be about \(1/\sqrt{n}\), where \(n\) is the number of samples; the corresponding singular vector in genotype space is the vector of genotype frequencies, and the total amount of variance this explains would be equal to \(n \sigma^2_f\), where \(\simga^2_f\) is the variance in genotype frequency. We can check this:
varfreq <- var( rowMeans(simdata$freqs) )
cat("Variance in allele frequencies: ", varfreq, "\n")
## Variance in allele frequencies: 0.1679013
cat("Top eigenvalue divided by sample size: ", pca$values[1]/nind, "\n")
## Top eigenvalue divided by sample size: 0.1679614
Now, we can get the PCA map we really want in several almost-equivalent ways: ignoring the top eigenvector:
plot(pca$vectors[,2:3],col=cols,pch=pchs)
or doing PCA after subtracting allele means:
npca <- eigen( cov(sweep(simdata$genotypes,1,rowMeans(simdata$genotypes),"-")) )
plot(npca$vectors[,2:3],col=cols,pch=pchs)
or, which is exactly equivalent to the previous version, doing PCA on the covariance matrix pre- and post-multiplied by the projection matrix:
pmat <- diag(nind)-1/nind
npca2 <- eigen( pmat%*%covmat%*%pmat )
plot(npca2$vectors[,2:3],col=cols,pch=pchs)
The data simluated above used logit-transformed correlated normals. Next, we could check that the relationship to distance still holds reasonably well. Here’s how the transformation has changed the covariance:
ut <- upper.tri(covmat,diag=FALSE)
plot( simdata$covmat[ut], covmat[ut], xlab="untransformed covariance", ylab="observed covariance" )
Hm. The relationship is not simple.
Here is the IBD plot, with colors denoting environmental distance, and the curves showing the covariance (of the untransformed frequencies) using the true parameters, at environmental distance 0 and 1 (the same, now, as we have \(\alpha_E=0\)):
geodist <- rdist(simdata$locs)
envdist <- rdist(simdata$env)
gvals <- seq(0,max(geodist),length.out=100)
pred.cov <- function (D,E,diag=FALSE,aD=1,aE=0,a2=1,a0=0.7,nugget=0.2) {
exp(-sqrt(aD*D^2+aE*E^2)^a2)/a0 + diag*nugget
}
predvals.0 <- pred.cov(gvals,E=0)
predvals.1 <- pred.cov(gvals,E=1)
ecols <- adjustcolor(rainbow(16,start=4/6,end=0),.2)[as.numeric(cut(envdist[ut],breaks=16))]
plot(geodist[ut], covmat[ut], col=ecols, pch=20, ylab="covariance", xlab="geog distance", ylim=range(0,covmat[ut],predvals.0))
lines( gvals, predvals.0, lwd=2 )
lines( gvals, predvals.1, lwd=2, lty=2 )
Hm, better not plot them on the same scale, now including a quick, least-squares fit of covariance to \(\exp(-\alpha_D D)/\alpha_0)\):
layout(t(1:2))
quick.lm <- lm(log(covmat[ut])~geodist[ut])
plot(geodist[ut], covmat[ut], col=ecols, pch=20, ylab="covariance", xlab="geog distance" )
lines(gvals, exp(coef(quick.lm)[2]*gvals+coef(quick.lm)[1]), lwd=2 )
plot( gvals, predvals.0, lwd=2 )
lines( gvals, predvals.1, lwd=2, lty=2 )